Red Wine Data set exploration by Arpit Sharma

Abstract

The objective of this analysis is to understand relationship of various features which impact the quality ratings of red wine. So, I will start by exploring the data to understand the relationship among different variables and will attempt to gain understanding of how these features impact wine quality.

So, let’s start exploring the wine data set which has 1599 obersvations with 12 explanatory variables on the chemical properties of the wine.

Data Set Link : https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Observations from the Summary

Univariate Plots Section

In order to explore this data visually, let’s create some visualizations -

Histogram of all features

Boxplot of all features

Following are the inference from the above plots-

  • Denity and pH featues are normally distributed and thus requires no transformation
  • residual sugar, chlorides, sulfur dioxides and sulphates seem to be long-tailed.
  • residual sugar and chlorides have extreme outliers.
  • Fixed acidity,residual sugar, Free sulfur dioxide, Total sulfur dioxide, Sulphates, Alcohol seems to be positively skewed.
  • Quality variable seems to be normally distributed with majority of observations as 5 and 6.

Let’s rescale these variables toward more normally distributed data. Skewed and long tail data can be transformed by taking square root or log function. In my case, I will do log transformation for skewed and long tail distribution.

Histogram - Fixed acidity, volatily acid and citrix acid

For fixed acidity and volatile acidity, the distribution seems to be almost normal after applying log transformation. Also, for volatile acidity, the distribution seems to be slighly bimodal.

Citric acid distribution is not normal even after applying log transformation. Also, Citrix acid seems to have lot of zero values. Also, majority of values are falling between 0.2 and 0.8 for citrix acid.

Histogram - Residual Sugar , Chlorides , Free sulfur dioxide, Total sulfur dioxide, Sulphates and Alcohol

Chorides, total sulfur dioxide and sulphates appears to be normally distributed after logarthmic transformation.

Residual sugar seems to be almost normal after log transformation.

Alcohol and free sulfur dioxide data seems to be bimodal.

Convert the quality variable from an integer to an ordered factor

Histogram - Wine Quality

Univariate Analysis

What is the structure of your dataset?

There are 1599 observation of red wines in the dataset with 12 features . All the 11 variables are numerical variable and quality is categorical variable. there are no NA in the dataset.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the data is quality variable which is output variable. The objective is to determine a relationship between other explantory variables and quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Variables such as fixed.acidity ,volatile.acidity,citrix acidm alcohol content are the main predictors of the wine quality. These variable may support my investigation, however, I might gain more insight on variables once I plot the bivariate plots.

Did you create any new variables from existing variables in the dataset?

Yes, for quality ( output varialble) as factor. I also created a quality rating bucket and grouped qualtiy into poor, good, excellent.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I noticed that the distribution of citric acid is unusual. Even after applying the log transformation, this variable data is not normal.

Aside from this, some other variables such as volatile acidity,Alcohol and free sulfur dioxide the distribution seems to be bimodal.

Some of the distributions were affected by the outliers. So, I transformed them using the log transformation and they seem to be normal after transformation.

Bivariate Plots Section

Correlation (ggcorr) - Wine data set variables

Correlation (cor) - Wine data set variables

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

So, let’s further examine those variables using ggplot which are having strong corelationship with each other.

ggpair - Wine data set variables

Based on the correlation matrix and ggpair plots, there doesn’t seem to be strong corelation between any of the two variables. However, there are some variables which are moderately corelated with each other. Let’s examine relationship between those varialbles using bivariate plots.

  • The top four variables that are corelated with quality variable are alcohol, sulphates, citric.acid, volatile.acidity . The variable volatile.acidity is negatively corelated (0.391)

  • Fixed acidity seems to be correlated with citric acid, density and pH (negatively corelated).

  • Density seems to be negatively correlated with alcohol content.
  • Sulphates and Chlorides seem to be moderately positively correlated.

Alcohol and Quality - Scatterplot

Alcohol and Quality - Box Plot

So, we can infer from the above plots that quality rating goes up with increased alcohol content . It is espcially true for excellent quality wine.

Alcohol and Density

There seems to be a moderate correlation between alcohol and density variables. So, a wine with higher alcohol content have less density.

pH and Fixed.acidity

pH and Fixed.acidity have a strong negative correlation between them.

volatile.acidity and quality

There seems to be a moderate correlation between volatile acidity and quality. Red wines with volatile acidity of less than 0.4 tend to have excellent quality.

Sulphates and Chlorides

There are lot of outliers in the data. So, looking qt the plot it seems that these 2 variables doesn’t have very strong relationship.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • Alcohol and quality have a moderate correlation so wines with (higher alcohol content tend to be in better quality. The correaltion is around 0.476.
  • Volatile acidity has a negative correlation with quality, and a positive correlation with pH.
  • Quality seems to go up when volatile.acidity goes down.. Red wines with volatile acidity of less than 0.4 tend to have excellent quality. Better quality wines tend to have lower densities.
  • Fixed.acidity seems to have little or no effect on quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  • Total.sulfer.dioxide and free.sulfer.dioxide strongly correlated, but these are not among our main features of interest.

  • Sulphates and Chlorides seem to be moderately positively correlated.

  • pH and density have a weak correlation so when density increases, pH tends to decrease.

What was the strongest relationship you found?

The strongest relationship is between fixed.acidity and pH.

Multivariate Plots Section

Fixed acidity and citric acid with quality

It seems that rise in both citric acid and fixed.acid have not significant impact on wine quality.

Alcohol and Density with quality

it seems that lower density wines with higher alcohol content tends to produce better quality wines.

Alcohol and volatile acidity with quality

Alcohol and pH with quality

From the above plot, It seems lesser PH and more alcohol makes wine better .

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

  • It also supports our earlier hypothesis that wine with higher quality and lower density led to better quality wines.
  • Volatile acidity lead to poor quality of wine. Excellent quality wines tend to have lower volatie acidity value. -It seems lesser PH and more alcohol makes wine better .

Were there any interesting or surprising interactions between features?

  • I got insight from the multivariate plot that although citric acid and fixed.acid are strongly corelated, but they have weak impact on wine quality.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

None


Final Plots and Summary

Plot One

Wine Quality

The wine quality data seems to be normal. However, we can also infer from the data that around 80% of the data beongs to red wines which are rated as 5 and 6 i.e good quality wines as per the criteria we stated above. So, this data seems to be biased towards good quality wine as we do not have enough representatin of poor and excellent quality wine samples. The other thing we can change is the criteria we used to define poor, good and excellent quality wines but this needs further investigation of this dataset.

Plot Two

Summary statistics for above plot

## winedata$quality_rating: Poor
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.22   11.00   13.10 
## -------------------------------------------------------- 
## winedata$quality_rating: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## winedata$quality_rating: Excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00

Relationship between quality and alcohol

The above is a box plot of alcohol and quality. Alcohol have strongest correlation with quality which is around 0.476. High quality wines appear to have higher alcohol content on an average as it is refleted from the above box plot.

From above statistical analysis, we can infer that the average alcohol content for high quality wine is 11.5% while good and poor quality wines have 10.25 and 10.22 respectively. The boxplot also shows that there is not much differece in alcohol content for poor and good quality wines although there seems to be many outliers in good quality wine data.

Plot Three

Description Three

The above plot describes the effect of Alcohol Percentage and Wine Density on wine quality. The higher the alcohol percenrage, the lower is the density. This visualization also supplement our earlier hypothesis that wine with higher quality and lower density led to better quality wines.

Reflection

I have a limited epxerience on R so this analysis was challening for me, but at the same time it was quite rewarding as it gave me opportunity to explore the entire wine data set and task of creating visualizations to find patterns in the data.

Through this exploratory data analysis, I was able to identify the key factors such as alcohol content, sulphates and acidity that contributes to wine quality.

In the begining, I had no idea that alcohol content has more influence on quality of wine as compared to other parameters, but the univariate, bivariate and multivariate analysis helped me to get this insight. So, this was a suprising insight for me.

Had I got some more time, I would have used regression model to fit the data in order to get more insight on wine quality and its relationship with other variables.